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Increased Computer Peripheral Throughput 
By Using Da<a Available Withholding 

Cross-Reference to Related Applications 

The foUowiiig patent applications, all assigned to the assignee of this ^plication, 
5 describe related aspects of the arrangement and operation of multiprocessor computer systems 
according to this invention or its preferred embodiment. 

U.S. patent plication serial number / by T. B. Berg et al. 

(BEA920000017US1) entitled "Method And Apparatus For Using Global Snooping To 
Provide Cache Coherence To Distributed Conqjuter Nodes In A Single Coherent System" was 
10 ffled on January _^ 2002. 

U.S, patent application serial number / by S.G. Lloyd et aL 
(BEA920000019US1) entitied "Transaction Redirection Mechanism For 

Handling Late Specification Changes And Design Errors" was filed on January , 2002. 

U.S. patent application serial number / by T. B. Berg et al. 

15 (BEA920000020US1) entitled "Method And Apparatus For 

Multi-path Data Storage And Retrieval" was filed on January , 2002. 

U.S. patent application serial number / , by W. A. Downer et al. 
(BEA920000021US1) entitied "Hardware S\q>port For Partitioning A Multiprocessor System 

To Allow Distinct Operating Systems" was filed on January , 2002. 

20 U.S. patent application serial number / by T. B. Berg et al. 

(BEA920000022US1) entitied "Distributed Allocation Of System Hardware Resources 

For Multiprocessor Systems" was filed on January , 2002. 

U.S. patent ^application serial number / by W. A. Downer et al. 
(BEA920010030US1) entitied "Masterless Building Block Binding To Partitions" was filed on 
25 January 2002. 
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U.S. patent {plication serial number / by W. A. Downer et al. 
(BEA92001003 lUSl) entitied "Building Block Removal From PartiticHis" was filed on January 
2002. 

U.S. patent £^Iicati(m serial number / . by W. A.Downeretal. 
5 (BEA920010041US1) entitled "Masterless Building Block Binding To Partitions Using 
Identifiers And Indicators" was filed on January , 2002. 



Bacl^oimd Of The Invention 
Technical Field 

The present invention relates generally to computer data cache schemes, and more 
5^ 10 particularly to a method and ^paialus for simid to 

from a standard peripheral computer interfece device when having multiple data processors in a 
system utilizing non-unifomi memciy access. 



o 

Sis 



Description of the Related Art 

In conqjuter S3^tem designs utilizing more than one processor operating simultaneously 
15 in a coordinated manner, data handling from peripheral component interface (PCf) devices is 
controlled in a fashion that provides only for single transactions to be processed at one time or 
in strict order, if mdtiple data output commands are received from one of 
system utilizing any number of such devices. In a multiprocessor system which uses non- 
uniform memory access where system memory may be distributed aax)ss mdtipl^ 
20 controllers in a single system this may limit performance. 

A PCI device, such as a hard disk controller, may issue a write command. Any 
multiple processor address control system will send a '^validate" indication of the data line to 
be written to all caching agents or processors. One method of handling such invalidate's in tiie 
past is tiiat a controller waits to receive acknowledgments that the data invalidate has been 
25 received and then makes that data line available for writing. The controller then sends an 

invalidate of a flag line for that data line, which was just made available for write. In the prior 
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art, many saich conlrollers will wait to receive acknowledgmente from all memory sources prior 
lo proceeding and then will accept the data from the PCI device attenq)ting to write to memory. 
After such a device writes to the memory management device, that device makes the flag line 
available. UsuaUy,controUers found in the prior art post write command 
5 order as the invalidate commands are issued on a particular PCI bus. 

All of this has the effect of slowing down system speed and therefore performance, 
because of component latency and because the ability of the system to process multiple data 
lines while waiting for invalidate indicators from other system processors is not fully utilized 

Summary Of The bivention 

10 A first aspect of this invention is a method for controlling the sequencing of data writes 

from peripheral devices in a multiprocessor computer system. The computer sytem includes 
groups of processors, with each processor groiq) interconnected to the other processor groiqps. 
In the method, a first data write is issued by a peripheral device in the system, queued, and 
checked for completion. The sequence order of overlapping write data is tracked. Both the 

1 5 first and the second write data are processed substantially simultaneously using one or more of 
the memory systems, but the processed second write data is output only after completion of 
the first data write. By starting the processing of subsequent data writes before completing 
previous data writes, the method of the invention increases overall performance of the system. 
Another aspect of this invention is found in a multiprocessor computer system, itself. 

20 The system has two or more groups ofone or more processors each. The system also has a 
peripheral device capable of initiating first and second data writes producing first and second 
write data, respectively, and a queue enable of sequentially ordering the data writes. A 
con^letion indicator determines completion of the first data write, and a sequencer tracks 
overlapping ofthe write data, both in resporise at least in part to the write d^ Thesystem 

25 includes storage for the first and second write data, and output for the first and second write 

data wMch responds at least in part to the seqiiencer and the completion indicator. Thestorage 
for the second write data is capable of accepting the second write data before completion of 
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the first data write, but the output for the second write data is c^ble of outputting flie secx)nd 
write data only after completion of the first data write. 

Other features and advantages of this invention will become apparent firom the following 
detailed desoiption of the presently preferred embodiment of the invention, taken in 
5 coqunction wilhthe acconqmyingdrawii^ 

Brief Description Of The Drawings 
Fig. 1 is a fimctional block diagram ofthe inbound data block wWch buffer 
the PCI bus according to the preferred embodiment of the invention. 

F^. 2 is a functional block diagram of the inbound data order queue of Fig. 1 , and is 
10 suggested forprinting on the first page of the issued patent 

Fig. 3 is a functional block diagram of the inbound data completion arbiter of Fig. 1 . 
Fig. 4A~4C is a Ic^c timing diagram illustratirig the iiib^^ 
during operation of the preferred embidoment 

Fig. 5 is a block diagram of a multiprocessor system having a tag and address crossbar 
15 and a data crossbar, and incorporating the inbound data block of Fig. 1. 

Fig. 6 is a block diagram of one quad processor group of the system of Fig. 5, 

Detafled Description Of The Preferred Embodiment 



Overview 

The piefenred embodiment of this invention allows data being issued fi-om periphery 
20 component interface (PCI) devices or other computer peripheral components to be almost 
simultaneously pnxessed in a parallel feshion without distortion of the data transaction timing 
sequence in multiple microprocessor computer. A PCI device can write two cache data lines, 
Ihe first cache line being called "data" and the second line being called a "flag". A memory 
control device for a group of processors receives these two cache line write transactions fiiom a 
25 PCI bridge device interfacing between tiie PCI peripheral and tiie control system. The control 
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system issues a write conmand, but the control system does not process subsequent 
transactions at that time. 

The control system looks up the state of the data cache line and issues invalidates to 
any p-ocessors in the system lhat hold a copy of that data line. At the same time Ihe control 
5 system signals oflier similar control systems associated with all other groups of processors that 

the invalidates have been issued The various control systems each associated with a group of 
processors invalidates the data line in the processor's cache on that particular groiq) of 
processors. Such conlroUers send an "acknowledge" comniand back to the origin 
indicating that they have invalidated the proper line in response to the original controller's 
10 request 

Overall performance of the conqjuter system is improved because the system handles 
the data for the "data'* cache line but does not make that data line visible to the rest of the 
system controllers. As soon as the "data'' has been inoved into the central memory cache 
system, the controller signals a particular PCI bridge chq> that it has completed the write and 

15 has deallocated the buffer space so that it can receive more transactions. Once the originating 
controller has received the indication fiom the central controll^ (comprised of a tag and 
address crossbar), that the invalidates will eventually be issued, the *Svrite flag" transaction can 
now proceed with the above steps of: issuing the "write flag" command to the central controller, 
having the central controller look the states and issue invalidates, and having the other 

20 controllers invalidate their processors and sending "acknowledge" commands to the original 

controller. The originating controller will receive acknowledgps fiom both the "data" and *11ag" 
lines. 

Acknowledges of invaUdates can thus be received in any order. For example, the 
acknowled^s for the flag inay be recdved before or after the aclmowledges for the 
25 without cormption of the ordering sequence. The invention ensures that the "data" line is made 
visible to the rest of the system only after all acknowledges for "data" line have been received. 
The invention also ensures that the "flag" line is made visible after botii the ' 'data' ' has beai 
made visible and all acknowledges for the "flag" have been received 
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Technical Detafls 

The present invention relates specifically to an itrproved data handling method for use 
in a multiple processor system which utilizes a tagging and address crossbar system for use in 
combination with a data crossbar system, together comprising a data processing system to 
5 process multiple data write requests concunently while maintaini^ 

write requests. The system maintains transaction ordering fiom a given PCI Bus throughout an 
entire system which employs multiple processors all of \^ch. may have access to a particular 
PCI Bus. The described invention is particularly usefiil in non-unifcrai memory access 
(NUMA) multi-processor systems. NUMA systems partition physical memory associated with 

1 0 one local group of microprocessors into locally available memory (home memory) and remote 
memory or cache for use by processors in other processor groups within a system. In such 
systems, apparatus which coordinates peripheral component interface (PCI) devices to control 
ordrang rules for processing write commands must coordinate data transfer to prevent 
overwriting or out of sequence data forwarding. Namely, a series of write commands firom a 

1 5 PCI device must be made visible to the system in the precise order that they were issued by the 
PCI device. Some tag and address crossbar systems used in multiple processor systems 
cannot allow a line of data to be made visible to the system until that line has been invalidated 
(i.e., an "invalidate" has been issued) thus insuring that no processor in the system has access to 
thatline. This has the eflFect of limitiiig the processing speed of the ^ 

20 Fig. 5 presents an exani^le of a ^ical multiprocessor systems in which the present 

invention may be used. Fig. 5 illustrates a multi-processor system which utilizes four separate 
central control systems (control agents) 66, each of which provides input/oxilput interfedng and 
memory control for an array 64 of four Intel brand Itanium class microprocessors 62 per 
control agent 66. In many applications, control agent 66 is an applicatim specific integrated 

25 circuit (ASIC) which is developed for a particular system af5)lication to provide the interfedng 
for each microprocessors bus 76, each memory 68 associated with a given control agent 66, 
F16 bus 21, and PCI input/output interface 80, along with ttie associated PCI bus 74 which 
connects to various PCI devices. 
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Fig. 5 also illustrates the port connection between each tag and address crossbar 70 as 
well as data crossbar 72. As can be ^>preciated fix>m the block diagram shown in F^. 5, 
crossbar 70 and crossbar 72 allow communications between each control agent 66, such that 
addressing information and data information can be communicated across the entire 
5 multiprocessor system 60, Such memory addressing system is necessary to communicate data 
locations across the system and facilitate update of control agent 66 cache information 
regarding data validity and required data location. 

A single quad processor group 58 is comprised of microprocessors 62, memory 68, 
and control agent 66. In multiprocessor systems to which the present invention relates, quad 

10 memory 68 is usually random access memoiy (RAM) available to the local control agent 66 as 
local or home memory. A particular memory 68 is attached to a particular controller agent 66 
in the entire system 60, but is considered remote memory when accessed by another quad or 
control agent 66 not directly connected to a particular memory 68 associated with a particular 
control agent 66. A microprocessor 62 existing in any one quad processor group 58 may 

1 5 access memory 68 on any other quad processor group 58. NUMA systems typically partition 
memcay 68 into local memory and remote memory for access by other quads processor groups 
58. The present invention enhances the entire system's abiUty to keep track of data writes 
issued by PCI devices that access memory 68 which is located in a processor group 58 
different from and therefore remote from, a processor group 58 which has a PCI device which 

20 issued the write. 

The present invention permits a system using multiple processors with a processor 
group interfece control ss^tem and an address tag and crossbar system to almost conqjletely 
process subsequent PCI device issued writes before previous writes from such PCI devices 
have been completely processed by the system. The present invention provides for a metiiod in 

25 which invalidates for a second write can be issued ia parallel with the invalidates for a first write 
being issued by a given PCI device. The invention ensures that the second write is not visible to 
the system, that is that no processor on a multiprocessor board can read the data from the 
second write, until the invalidates from the first write has been received. In the preferred 
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embcxJiment, multiple writes issued sequentially by a PCI de\dce may be processed in parallel 
and out of sequence with the order in which the writes were issued while insuring that the output 
sequence of the writes remain identical to the order of the input writes. Fig. 6 is a different 
view of ttie same multiprocessor system shown in F^. 5 in a sin^ler view, illustrating one 
5 processor group 58» sometimes referred to as a Quad, in relation to the other Quad 58 

components as well as crossbar system 70 and 72 illustrated for sin:^lici1y as one unit in Fig. 6. 

Turning now to Fig 1 , the invention will be Scribed with reference to the fimctional 
block diagram for the inboimd data block. Inbound data block (IDB) 10 internes witii data 
organizer (DOG) 12, request completion manager (RCM) 14, transaction manager (X-MAN) 

10 16, inbound command block (ICB) 18, and the F16 interface block 20. IDB 10 buffers the 
data from F16 bus 21 to the DOG 12. IDB 10 is also responsible for maintaining the data 
ordering to allow the posting of inbound transactions from the PCI bus 74 being generated by a 
PCI device and delivered from each PCI input-oulput intralace 80. F16 interface block 20 
represents an initial interface inside control agent 66 to the group of four F16 bus 21. Block 

1 5 20, which is associated with each F 16 bus 21, pushes data into the inbound data queue (IDQ) 
22 as it arrives from an individual F16 bus 21. In the preferred embodiment of ttie present 
invention, F16 bus 21 is comprised of a proprietary design utilized by the Intel brand of 
microprocessors and associated microprocessor chip sets which is known commonly as the 
/wte/F16bus. As can be seen in Fig. 5, F16 bus 21 acts as a bridging bus between control 

20 agent 66 and PCI input/output (10) interface 80 which can connect a particular PCI device to 
the system. The invention presenfly described may be applied to other types of data interfeces 
which issue data sequentially for use by a processor system. Though the preferred embodiment 
is described utilizing an Intel brand PCI bridge chip set, it should be appreciated that otiier 
device interfeoes utilizing other con^^nent bus interfece systems which interconnect system 

25 devices, such as disk drives, video cards or other peripheral coaqx)nents may be utiUzed in 

carrying out the system described herein. Each individual quad control agent 66 utilizes four of 
such PCI bus 74 connected through PCI bridge chips 80 v^ch are in turn connected to the 
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F16 interface block 20 contained within each individual conlrol agpnt 66 via Ihe F16 bus 21. 
The four F16 buses 21 operate in a parallel fashion and simultmeously as illustrated in Fig, 5. 

Each microprocessor 62 commumcates to the rest of the system through its individual 
processor bus 76, which communicates with its respective control agent 66, especially ia fee 
5 preferred embodiment wherein each group 64 of fom Itanium microprocessors 62 are also 
connected in a larger array comprised of four quads 58 of like processor arrays as depicted in 
Fig. 5. The Itanium microprocessor used in the preferred embodiment is a device 
manufactured by Intel. 

RCM 14 acts as the control responsible for tracking the progress of all incoming 

1 0 transactions initiated by the PQ bus 74 and scheduling and dispatching the appropriate data 
responses. RCM 14 controls the data sequencing and steering through DOG 12 and 
streamlines the data flow through the data crossbar system used to connect multiple processor 
groiq)s in diher a single quad or multiple quad processor system confi^^ DOG 12 
provides data storage and routing between each of the major interfece modules shown in Fig. 

15 1. The heart of DOG 12 is a data buffer in which up to 64 cache lines of data can be stored 
Surrounding the storage area is logic hardware that provides for no-wait writes into and low 
latency reads out of the data buffer within DOG 12. 

Cbtrtbuing with Fig 1, when all inbound data from the F16 interface block 20 has been 
moved, the inbound cormnand block 18 also schedules the transaction in 

20 order queue (IDOQ) 24. IDOQ 24 makes the transaction available to the inbound data 

handler (IDH) 26 for data movement IDOQ 24 is responsible for keeping the proper order of 
all data flowing fiiom given F16 bus 21 to a given control agent 66. Specifically, IDOQ 24 
tracks and maintains the order of all kibound writes to memory 68, inbound responses to 
outbound reads when processor 62 reads data on a PCI device, and inbound interrupts in the 

25 system. In system 60, the control agents 66 must track the order of all inbound data, whether a 
PCI device write or other data. IDOQ 24 keeps track of such data to maintain information 
regarding the order of any data. IDH 26 schedules the transaction and moves the data fix>m 
IDQ 22 to DOG 12. After the data has moved, IDOQ 24 is notified and the resources used 
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by the transaction are fieed to eite the transaction manager 1 6 or the outbound command 
block (not shown). 

If the transaction being handled was a cacheable write, then RCM 14 will signal that all 
of the invalidations have been collect© A Once an ofthe ordering requirements have been met, 
5 the inbound data con^lete arbiter 30 wiU signal RCM 14^ 

the data is available to DOG 12. If the transaction being processed is a cacheable partial write, 
then IDH 26 will s^nal DOG 12 to place the transaction in a cacheable partial write queue 
where it awaits the back^ound (feta. 

Turning now to Fig. 2, Inbound Data Order Queue 24 is presented in operational 

10 terms. IDOQ24hasfourmaininterfecesasshowninFig. 2. Those interfaces are the Inbound 
Command Block (ICB) 18, the Response Completion Manager (RCM) 14, the Inbound Data 
Completion Arbiter (IDCA) 30 and the Inbound Data Handler (IDH) 26. 

When a write operation is issued from F16 bus 21, it enters ICB 18 through F16 
interface block 20. ICB 18 presents the write request to transaction manager 16, which is 

1 5 located within control agent 66, routes the write request to a bus within control agent 66 

connected to tag and address crossbar 70 as mem fully illustrated in Fig, 5. Tag and address 
crossbar 70 receives the write operations, looks up tag information for the corresponding 
address of the data and determines which agent 66 in quad processor system 60 must receive 
an invalidate operation. As tag and address crossbar 70 issues an invalidate to other control 

20 agents 66, tag and address crossbar 70 also issues a reply to the requesting agent 66 which 

indicates how many acknowledgments it should receive. Further, tag and address crossbar 70 
also signals the requesting control agent 66 if it must also issue an invalidate operation on a 
processor bus 76 connected to that control agent 66. In the event that the effected control 
agent 66 must issue an invalidate operation, such operation is treated like any other invalidate or 

25 acknowledge pair command. Tag and address crossbar 70 jHOvides a reply as indicated 

above, the transaction involved is moved from ICB 18 to IDOQ 24. At this point, IDOQ 24 
begins to track the order and location of the particular data being read from the PCI bus 2 1 . 
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In the event of an outboimd read event, ICB 18 receives a data response from F16 bus 
21 through F16 interface block 20 for a previously issued oudx)und read, for exanq)le, if a 
processor is reading from a PCI card, and ICB 18 does not give the command to transaction 
manager 16. Instead, ICB 18 forwaaxb the command direcliy into IDOQ 24 as more My 
5 described below. 

When tag and address crossbar 70 has given a reply to a new inbound data operation, 
ICB 18 asserts a valid signal along with olher information reading that particular new inbound 
data. Such additional infonimtionwluch is tracked includes the l3mis^^ 
for the transaction; whether the transaction is an inbound write request or a response to an 
1 0 outbound read or an interrupt; \\iiether the data is a full cache line or just a partial line of data; 

and if the operation involves a partial line of data, which part of the cache Une to which the data 
is associated. 

As ICB 18 forwards the operation to IDOQ 24, as shown in Fig. 1, other logic is 
moving corresponding data into IDQ 22. EDQ 22 fimctions as a true first-in/first-out ^PIFO), 
15 so that the order of TrID's given from ICB 18 to IDOQ 24 must match the order of data 
loaded into EDQ 22. 

When IDOQ 24 receives a new operation from ICB 18, such information as described 
above is loaded into a register within IDOQ 24, shown more particularly in Fig. 2 as CAM 
TrK) register 84. M the preferred embodiment, there are eight queue locatiotis in the regi^ 

20 The register location which is loaded is the location pointed to write pointer 86 in Fig. 2. For 
example, if pointer 86 has a value of 3, then IDOQ 24's register 3 is loaded. Once such an 
operation is written to the register, write pointer 86 is incremented so that tiie next operation 
would go to the next queue location in turn. If write pointer 86 is a value of 7 and is then 
incremented, it rolls over and begins at zero again. It will be appreciated by tiiose skilled in the 

25 art that using pointers in this nfiaxmer is a known metiiod used for implemented queues in a 
variety of different queuing operations or procedures. 

When all of the acknowledgments have been received for a particular write operation, 
RCM 14 asserts an (ACK) signal and provides a corresponding transaction identification 
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(TrE)),shownat88mFig.l. IDOQ 24 iwognizes the ACK signal provided by RCM 14 
and then simullaneously compares Ihe ACK and TrDD 88 to each TrE) in the registers within 
the IDOQ 24. The logic in IDOQ 24 then sets the corresponding ACK DONE data bit with 
the register within nX)Q 24 that contains the TrID tto 
5 TrID 88. 

Continuing with Fig. 2, the functional interface between IDOQ 24 and Inbound Data 
Handle QDH) 26 will be described. IDOQ 24 sujylies information to IDH 26 corresponding 
tothenextdataIinetobemovedfix>mtheIDQ22totheDOG 12, IDOQ 24 passes on such 
information it receives fijom ICB 18 for a given transaction, specifically those items described 

10 above regarding pertinent information about a valid transaction asserted by ICB 18. IDH 26 
does not differentiate whetha: such an operation is either a read or a write. Usingthe 
information described above regarding the data, IDH 26 controls a transfer of data fi-om IDQ 
22 to DOG 12 appropriately. 

In accordance with the description above, IDOQ 24 must provide the TrIDs of IDH 

15 26 in the same order as such transaction identifications were received fi'om ICB 18, as data 
was loaded into IDH 2 6 in the same order and must be called up fiiom IDH 26 in the correct 
order. To implement the ordering, IDOQ 24 utilizes a move pointer 90, shown in As 
the present system is initiali2ed or reset, move pointer 90 as well as write pointer 86 are set to 
an initial value of zero. When IDOQ 24 is loaded by ICB 18, write pointer 86 is incremented 

20 as earlier described When the value of write pointer 86 is not equal to the value of move 
pointer 90, IDOQ 24 is signaled that there is data to be moved and thereby asserts a valid 
signal to IDH 26. In the event that the value of write pointer 86 and the value of move pointer 
90 are not equal, as can be seen in Fig. 2, the compare block 93 asserts a valid signal when 
such values are not equal. IDOQ 24 also supplies information fix)m the IDOQ 24 registers 

25 which are being identified by write pointer 86. 

Once data has been moved from IDQ 22 to DOG 12, IDH 26 signals IDOQ 24 by 
asserting a data moved signal shown at 94. When this occurs, move pointer 90 is incremented 
to the next value. If there are no more valid entries in IDOQ 24, move pointer 90 will be equal 
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to write pointer 86 and the valid signal will be un-asserted. In the event that there is another 
entry in IDOQ 24, write pointer 86 and move pointer 90 will not be equal and thus IDOQ 24 
will maintain a valid condition and supply the TrID corresponding to the next IDOQ 84 register. 
Continuing to consider Fig. 2, IDOQ 24 is also operatively connected to inbound data 
5 conpletion arbiter (IDCA) 30. The operational connection described in Fig, 2 allows IDCA 
30 to provide an indication that data for a particular TrID is available to be read by another 
transaction. ITais available transaction must only be rea4 for a given TrID, once i 
been moved into DOG 12 and, if the transaction being considered was a write, IDOQ 24 must 
have also received an acknowledged done (ACK-DOISIE) indication for that TrID. The order 

10 of the TrID's given to IDCA 30 must be the same order as the TrK) supplied originally by ICB 
18, There is a separate IDOQ 24 for each F16 bus 21, each IDOQ 24 handling data from 
separate PCI buses. Inbound Data Complete Arbiter 30 is responsible for looking at the data 
item at the head of each IDOQ 24, determining which if any of each data item can be sexA to 
the RCM 14. IDCA 30 selects from the ou^ut of various IDOQ 24's which may be 

1 5 producing data simultaneously. IDCA 30 determines vdien each IDOQ 24 can send data to 
RCM. 

Continuing to consider Fig. 2, available pointer 96 always points to the top available 
position of the queue. Once pointer 96 is incremented past a particular data entry in the queue, 
that entry is not considered valid. IDOQ 24 si5>pUes IDCA 30 with infornaationr^3rding the 

20 value at the top of the queue. Such value is the value entered in the IDOQ 24 register which is 
cunentiy selected by available pointer 96. EDOQ 24 provides the TrID, such TrID's ACK 
DONE bit as described above, and determines whether the data has been moved by 
considering the move signal output 91 from compare block 92 shown in Fig. 2. The IDCA 30 
will thm assert a signal to RCM 1 4 (also shown as connection between IDCA 3 0 and RCM 

25 14 in Fig. 1) for a given TrID. Once that TrID is suppUed by IDOQ 24 and when both the 

ACK and moved signals are asserted, if the operation is a read, or an interrupt as opposed to a 
write operation, the ACK DONE bit will automatically be set so that IDCA 3 0 will only wait 
for the data to be moved for that operation. 
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Once the ACK and moved signal 91 are both asserted for a given TrE), IDCA 30 
signals RCM 14 that another transaction cm Also at this time, 

IDCA 30 signals that available pointer 96 should be incremented by transmission of the 
increment signal 98 shown in Kg. 2. Signal 98 increments the available pointer 96 to the next 
5 entry available in IDOQ 24. Move signal 91 is asserted ifavailable pointer 96 is not equal to 
move pointer 90, It can appreciated that compare block 92 produces move signal 91 if 
available pointer 96 is not equal to pointer 90. 

Whenmove signal 91 is issued, data for the TrID at the top of the IDOQ 24 to which 
available pointer 96 has incremented has been moved into DOG 12. However, the affected 
10 TrID has not been given to RCM 14 throu^ IDCA 30 at that time. Once IDCA 30 has given 
the TrID to RCM 14, available pointer 96 is incremented and that operation is no longer 
considered in IDOQ 24. 

It should be noted in the event that all of the registers in IDOQ 24 are full, (the IDOQ 
24 in the preferred embodiment having 8 registers), and if all such registers have vaKd TrIDs, 
15 IDOQ 24 asserts a PCI M signal 97 shown in Fig. 2. hi this condition, signal 97 indicates to 
ICB 1 8 that the IDOQ 24 cannot handle any more requests and therefore must not issue any 
more operations to IDOQ 24 until a register is available. For inbound Avrites, once the TrID is 
into IDOQ 24, and the data is moved into IDQ 22, control agent 66 issues a response back to 
PCI irput/oulput interfece 80 that the PCI device can send more operations even though the 
20 previous writes are not complete. 

Turning now to Fig. 3, the inbound data completion arbiter 30 will be described 
When the transaction is both Valid and Acknowledged 34, thm it enters arbitration for signaling 
to the RCM 14. The winner of this arbitration process is signaled to the RCM 14, and the 
corresponding IDOQ 24 is notified that the transaction has completed. The last PCI register 
25 indicates which of the lODQ 24 has most recently won arbitration in the above process, and is 
used by Round Robin A3i>iter 38. 

Inbound Data Queue 22 is a memory that stores the iribound data and byte enables 
firom either a write request or a read completion fi-om F16 interface block 20. IDQ 22 is 
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physically comprised of manory staring the byte enables, and two memories storing the data 
associated with a 128 bit line of data with a 16 bit Error Correction Code (ECC) element 
attached. It should be understood that when reference is made to a 128 bit line of data, a 128 
bit data word with 1 6 bit ECC is included. Data is written to the IDQ 2 2 , one data word of 
5 64 bits plus 8 bits of ECC at a time and is property aligned with the 1 28 bit hne of data by F 1 6 
block 20. IDQ 22 is protected by ECC Codes for the data and by parity for the byte enables. 
The ECC is generated by F16 interfece block 20 and checked by the DOG 12 while the parity 
of the data is checked locally. 

Turning to Fig, 4, (presented in three parts as Fig. 4A, 4B and 4C for clarity but 

10 representing one diagram), an Inbound Data Timing Diagram for the IDE 10 iflustiatin^ 

partial and full cache line write request is shown. Fig. 4 illustrates the timing sequence of both a 
partial and a full cache line write request for the entire Inbound Data Block 10. IDOQ 24 is 
maintaining the order of the data traiisfers. The transactions have the invaUdalescoUected in the 
order two, one, and four, shown at ACK TrID 54, but the original data order is maintain as 

15 shownintheRCMmovedTrlDtiininglineSSinFig. 4. RCM 14 is not notified of the 
txansaction two's data movement until transaction one has had the invalidates collected 
Transaction three in Fig. 4 was a read transaction and therefore does not require an 
acknowledgment for signaling the RCM 14. Finally, transaction four in the Fig. 4 Timitig 
Diagram is not signaled to the RCM 1 4 until the data has been moved, since the 

20 acknowledgment was signaled a few clock cycles before it started to transfer. 

Advant^es 

The preferred embodiment improves the logical sequencing of data writes of PCI 
devices in a multiprocessor system having a plurality of memory systems, each memory system 
associated with at least one processor but using a common data cache system and control 
25 system for all of the processors. The method described provides for overlapping data write 

processing so that processing of subsequent write commands issued by a PCI device can begin 
prior to the conq)letion of previous write commands issued by a PCI device without corrupting 
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tfaeorigmaltransactioa order req^ This 
overlapping results in increased system performance by increasing the number of transactions 
that can be processed in a given period. 

Alternatives 

5 The invention can be employed in any multiprocessor system that utilizes a central 

conttol agent for a groiq) of microprocessors, allhou^ it is most beneficial when used in 
conjunction with a taking and address crossbar system along with a data crossbar system 
which attaches multiple groups of processors enploying non-uniform memory access and 
divided or distributed memory across the system- 

10 The particular systems and method v^ch allows parallel processing of sequentially 

issued PCI device write commands throu^ the device bus in a multiprocessor system as shown 
and described in detail is ftilly cs^mble of obtaining the objectives of the invention. However, it 
should be understood that the described embodiment is merely an example of the present 
invention, and as such, is representative of subject matter which is broadly contemplated by the 

15 p:esent invention. 

For example, the present invention is disclosed in the context of a particular system 
which utilizes 16 processors, comprised of four separate groups of four with each group of four 
assigned to a memory control agent which interfaces the PCI devices, memory boards 
allocated to the group of four processors, and for which the present invention functions to 

20 communicate through other subsystems to like controllers in the other three groups of four 
disclosed Nevertheless, the present invention may be used with any system having multiple 
processors, with separate memory control agents assigned to control each separate group of 
multiprocessors when each group of processors requires coherence or coordination in handling 
data read or write commands for multiple per5>heral devices utilizing various interface protocols 

25 for sequentially issued data writes from other device standards, such as ISA, EISA or AGP 
peripherals. 
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The system is not necessarily limited to the specific numbers of processors or the array 
of processors disclosed, but may be used in similar system design using interconnected memory 
control systems with tagging, address crossbar and data crossbar systems to communicate 
between the controllers to in5>lement the presmt invention. Accordingly, the scope of the 
5 present invention fidly encompasses other embodiments which may become apparent to those 
skilled iti the art, and is to be limited only by the claims which follow. 
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